In this SQL project, I use sql to clean data that can later be used for further analysis. Here is a step to step walk through of the project.
I started by creating a new database and importing the data.
I then had an overall overview of the data to understand it and to determine the what, where and how it would be cleaned.
I then standardized the date column. This involved:
I then went ahead to populate the missing values in the property address column. I observed that for the missing values, there was a similar row entry with the same parcel ID but different Unique IDs. So the next step was to populate the missing property address with the available property address of the other entry with the same parcel ID as they should be similar.
I then went ahead to split the property Address Column into the actual address and the associated city which are in one column.
I then split the owner Address Column
The next step was to standardize the SoldAsVacant column. The first step was to find out the values present and their count. This would give us a basis on how to standardize it based on the entries with the highest count. Since 'Yes' and 'No' have the highest count, we convert the Y and N into Yes and No respectively. This also makes it more understandable as Y and N can be confusing and open to misinterpretation on what they could mean.
I then went ahead to remove duplicates in our value. This however is not standard practice and should be done with care. Alternatively, creating a working sheet from the original data for such manipulation is advisable while reserving the original data as it is.
Finally, I finished off by deleting unused columns. This columns are Owner Address, TaxDistrict, PropertyAdress and SaleDate columns.
To download and view the full project on GitHub, click here.